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Abstract- This  paper  describes  a  prototype  system  for  the 
intelligent  classification  of  electrolaryngograph  (EGG)  signals 
in  order  to  provide  an  objective  assessment  of  voice  quality  in 
patients  at  different  stages  of  recovery  after  treatment  for 
larynx  cancer.  The  system  extracts  salient  short-term  and  long¬ 
term  time-domain  and  frequency-domain  parameters  from 
EGG  signals  taken  from  male  patients  steadily  phonating  the 
vowel  HI.  The  quality  of  these  voices  was  also  independently 
assessed  by  a  Speech  and  Language  Therapist  (SALT) 
according  to  their  7-point  ranking  of  subjective  voice  quality. 
These  data  were  used  to  train  and  test  a  Multi-layer 
Perceptron  (MLP)  neural  network  to  classify  EGG  signals  in 
terms  of  voice  quality.  Several  MLP  configurations  were 
investigated  using  various  combinations  of  these  signal 
parameters,  and  the  best  results  were  obtained  using  a 
combination  of  short-term  and  long-term  parameters,  for 
which  an  accuracy  of  92%  was  achieved.  It  is  envisaged  that 
this  system  could  be  used  as  a  valuable  aid  to  the  SALT  during 
clinical  evaluation  of  voice  quality. 
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I.  Introduction 

Electrolarygography  (EGG)  measures  the  impedance  of  the 
electrical  signals  transmitted  through  the  living  tissue 
surrounding  the  larynx  in  the  neck  [1].  Earlier  work  has 
shown  that  there  are  structural  differences  in  the  EGG 
signals  of  stationary  vowels  for  normal  and  pathological 
voices  (Fig.l),  and  that  parameters  derived  from  the  EGG 


Fig.  1 .  Electrolaryngogaph  signals  and  frequency  spectrum  for  males 
phonating  /i/:  pathological  voice  a)  and  b),  normal  voice  c)  and  d) 


signals  can  be  used  to  train  a  Multi-layer  Perceptron  (MLP) 
neural  network  to  classify  the  signals  as  normal  or 
pathological  with  an  accuracy  of  80%  [2]. 

Whilst  this  system  provided  good  classification  between 
normal  and  abnormal  voice  quality,  the  feature  set  was 
limited  to  sub-optimal  classification  results  as  it  is  well 
known  that  some  pathologies  are  measured  more  easily 
using  long-term  (>50ms)  parameters  [3].  This  paper 
describes  the  refinement  of  the  MLP  approach  to  objective 
voice  quality  assessment,  by  introducing  long-term  features 
to  the  classification  system. 

In  addition,  the  system  has  been  extended  to  provide  a 
graded  classification  of  pathological  voices  in  line-with  the 
7-point  ranking  scheme  used  by  Speech  and  Language 
Therapists  (SALT)  to  assess  voice  quality.  At  present, 
SALTs  endeavor  to  rehabilitate  a  patient's  voice  back  to 
normality,  or  as  near  normal  as  possible,  quickly  following 
treatment.  Their  ranking  scheme  (0=least  abnormal,  6=most 
abnormal)  is  based  on  a  variety  of  sound  parameters,  some 
of  which  are  well  defined,  such  as  shimmer  and  jitter,  while 
others,  such  as  whisper  and  creak  are  descriptive  or  have 
tenuous  physical  correlates,  As  a  result,  the  assessment  is 
largely  subjective  and  depends  upon  the  experience  of  the 
SALT.  The  intelligent  voice  quality  system  described  here 
aims  to  improve  this  situation  by  providing  accurate, 
reproducible,  graded  measures  of  a  patient's  voice  quality  to 
help  the  SALT  plan  the  patient's  rehabilitation  more 
accurately. 

II.  DATACAPTURE 

The  data  used  to  develop  the  system  was  captured  off¬ 
line  under  clinical  conditions  at  the  Christie  and  Withington 
Hospitals  in  Manchester,  using  an  Electrolaryngograph 
PCLX  system.  This  system  is  used  to  capture  the 
electrolaryngograph  signals  using  pads  placed  either  side  of 
the  neck.  Acoustic  signals  were  also  recorded  using  a 
microphone,  but  have  not  been  used  at  this  stage.  Both  the 
EGG  and  acoustic  data  channels  were  captured 
synchronously  at  20kHz  with  16-bit  Analog-to-Digital 
converters  for  up  to  3  seconds  while  the  subject  phonated 
the  vowel  /i /  as  steadily  as  possible. 

Although  speech  data  was  recorded  for  both  male  and 
female  patients,  the  largest  pathological  group  was  male,  so 
it  is  these  speech  signals  that  were  used  in  the  study,  For 
each  patient  the  SALT  made  a  subjective  voice  quality 
assessment  using  a  7-point  ranking. 
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III.  Data  processing 

A  voicing  analysis  was  performed  upon  each  3 -second 
EGG  signal  to  determine  if  the  subject  had  voiced  during 
phonation.  Voicing  occurs  when  the  vocal  folds  are 
vibrating,  and  as  a  result,  the  signal  contains  a  fundamental 
frequency,  f0.  If  the  signal  was  considered  to  be  voiced,  it 
was  initially  processed  to  extract  the  long-term  features,  and 
then  the  short-term  features  for  classification  of  voice 
quality.  The  long-term  features  are  mean  fundamental 
frequency,  Mf0,  standard  deviation  of  f0>  SDf0,  and 
percentage  of  the  signal  that  is  voiced,  V+,  while  the  short¬ 
term  features  include  parameters  related  to  the  structure  of 
the  first  few  harmonics,  and  the  glottal  noise. 

The  voicing  test  involved  taking  50msec  frames  from  the 
signals  and  applying  Cepstral  analysis  techniques  [4],  to 
identify  the  voiced  frames.  Each  frame  was  then  pre¬ 
emphasised  by  forward  differencing  to  suppress  the  effects 
of  drifting  signal  amplitude,  and  its  autocovariance 
multiplied  by  a  Tukey-Hanning  window,  prior  to 
transformation  to  the  frequency  domain  using  the  Fast 
Fourier  Transform.  An  estimate  of  f0  for  each  frame, 
deduced  during  the  voicing  analysis,  was  used  to  derive  the 
FHN  normalised  spectral  representation  [5].  This  process 
removed  any  inter-patient  variability  in  f0  and  its  harmonics 
allowing  a  more  effective  modelling  of  the  spectral  envelope 
among  groups  of  patients.  Once  the  FHN  spectrum  had  been 
determined,  Gaussians  were  fitted  to  the  data  around  f0  and 
its  first  few  harmonics.  Each  Gaussian,  Gh  (h=0  up  to 
typically  8)  was  parameterised  as: 

Gh  =  (position^  widthh  and  amplitude^ 

An  observation  was  made  that  the  mixture  of  Gaussians 
gave  a  better  4 fit’  to  the  FHN  spectrum  for  the  less  abnormal 
patients,  and  so  a  parameter  related  to  goodness  of  fit,  called 
the  Harmonic  Finearity  Measure  (HEM),  was  calculated  for 
each  frame.  Finally,  as  Glottal  noise  is  considered  to  be  an 
important  measure  of  voice  quality,  a  parameter,  FHNNE, 
based  on  the  Normalised  Noise  Energy  (NNE)  [6],  but 
derived  from  the  FHN  spectrum,  was  calculated  for  the  data. 

The  parameters  extracted  from  the  EGG  signals  and  used 
for  the  MFP  classification  tests  comprised  of  3  long-term 


parameters  (Mf0,  SDf0,  V+)  and  17  short-term  parameters 
(G1?  G2,  G3,  G4,  G5,  HEM,  FHNNE).  Full  details  of  the  data 
processing  and  extraction  of  these  parameters  can  be  found 
in  McGillion  [7]. 

IV.  Data  classification 

A  total  number  of  77  pathological  EGG  signals  were 
available  for  training  and  testing  data.  For  each  of  the  7 
classes,  450  patterns  were  used  for  training/validation  and 
200  for  testing.  Unfortunately,  as  a  result  of  the  relatively 
small  dataset,  there  were  different  numbers  of  patients  in 
each  class.  As  it  is  desirable  to  have  equal  numbers  in  each 
class  to  train  an  MFP  adequately,  additional  frames  were 
taken  from  some  patients  and  a  small  percentage  of  the  data 
was  artificially  generated  by  adding  normally  distributed 
noise  to  the  short-term  features  of  the  existing  patterns 
within  each  class. 

A  two-layer  7-output  MFP  was  trained  using  the  back- 
propagation  training  algorithm,  softmax  activation  function, 
and  cross-entropy  error  function.  The  advantage  of  using  the 
softmax  activation  function,  was  that  the  output  across  all 
seven  classes  sums  to  1.0  and  can  therefore  be  interpreted  as 
a  probability  of  membership  of  each  of  the  seven  classes, 
assuming  equal  prior  probabilities.  A  further  constraint 
placed  upon  the  MFP  is  that  for  any  single  class  to  be 
declared  the  ’winner’  the  output  for  that  class  must  be  greater 
than  50%  (0.5).  MFP  structures  with  different  numbers  of 
hidden  units  and  subsets  of  the  21  input  parameters  were 
investigated  in  order  to  determine  the  combination  that 
provided  the  minimum  classification  error. 

V.  Results  and  discussion 

An  overview  of  the  best  results  obtained  from  the 
different  combinations  of  input  parameters  and  hidden  units 
shown  in  Table  1. 

The  best  input  set  is  the  MFP  structure  whose  input 
parameters  provided  the  best  classification  results,  regardless 
of  weight  initialisation  and  the  number  of  hidden  units.  The 
best  individual  structure  is  the  MFP  that  provided  the  best 
classification  results  taking  into  account  the  number  of 
hidden  units  in  the  MFP,  but  disregarding  the  weight  hidden 


TABFE  1 

TEST  RESULTS  FOR  THE  7-CLASS  MLP 


Accuracy 

(%) 

SD 

Sensitivitv  (%) 

Specificity  (%) 

Structure 

Inputs 

Class 

0 

Class 

1 

Class 

2 

Class 

3 

Class 

4 

Class 

5 

Class 

6 

Best 

Individual 

MFP 

92.00 

6.42 

98.5 

96.0 

94.5 

80.5 

86.5 

91.5 

96.5 

20-25-7 

G1,G2,G3,G4,G5,  FHNNE, 
HLM,  Mf0,  SDf0,V+ 

0.12 

0.87 

0.75 

4.12 

3.0 

2.12 

0.75 

Best 

Individual 

Structure 

90.30 

1.65 

98.1 

94.6 

93.1 

86.9 

83.6 

81.7 

94.1 

20-40-7 

Gi,G2,G3,G4,G5,  FHNNE, 
Mf0,  SDf0,V+ 

0.47 

1.3 

1.42 

3.1 

3.82 

4.17 

1.37 

Best  Input 
Set 

87.24 

3.47 

98.0 

93.5 

92.2 

81.1 

77.3 

74.8 

93.5 

2 1  -[  1 5,25,40]-7 

G1,G2,G3,G4,G5,  fo. 
FHNNE, HLM,  Mf0,  SDf0,V+ 

0.45 

1.49 

1.52 

4.02 

4.67 

5.28 

1.25 

hidden  units  in  the  MLP,  but  disregarding  the  weight 
initialisations.  The  best  individual  MLP  is  that  which  takes 
into  account  both  the  number  of  hidden  units  and  the  weight 
initialisation. 


MLP  CLASSIFICATION  OF  CLASS  O  ABNORMALS 


CLASS 


The  best  individual  MLP  can  be  different  from  the  best 
individual  structure  and  the  best  input  set  if  the  variance  in 
the  classification  ability  of  a  given  MLP  is  large  (due  to  the 
MLP  might  produce  a  very  high  accuracy,  the  average 

MLP  CLASSIFICATION  OF  CLASS  1  ABNORMALS 


C  LASS 


MLP  CLASSIFICATION  OF  CLASS  6  ABNORMALS 


CLASS 


Fig.  2.  The  20-25-7  MLP  classification  performance  for  each  of 
the  7-classes 


performance  can  be  very  poor.  Therefore,  it  is  important  to 
consider  all  three  results  when  determining  the  best 
performing  input  set  and  MLP  structure. 

The  best  overall  MLP  structure  was  a  20-25-7,  using  the 
parameters  [G1?  G2,  G3,  G4,  G5,  FHNNE,  HLM,  Mf0,  SDf0, 
V+],  and  the  results  indicate  that  this  MLP  was  able  to 
distinguish  between  the  seven  abnormal  groups  with  an 
accuracy  of  up  to  92%. 

Figure  2  shows  the  output  of  the  MLP  for  each 
pathological  class.  The  output  of  the  MLP  is  an  estimate  of 
the  posterior  probability  of  membership  for  each  class.  It 
should  be  noticed  that  there  are  only  two  cases  where  the 
output  falls  below  the  majority  threshold  0.5,  and  only  one 
misclassification  (for  class  3).  Perhaps  unsurprisingly,  the 
classes  at  the  two  extremes  of  the  scale,  0  and  6,  provide  the 
best  classification  results.  In  all  cases,  classes  3,  4,  and  5  are 
the  most  difficult  to  discriminate  between. 

All  the  short-term  features  were  found  to  contribute  to 
the  classification.  The  classification  accuracy  increased  from 
26.5%  with  [Gj]  alone  to  67.7%  with  [G1?  G2,  G3,  G4,  G5]. 
Adding  the  other  short-term  features  [FHNNE]  and  [HLM] 
increased  the  discrimination  ability  of  the  MLP  to  72.07% 
and  68.64%  respectively.  Similarly,  the  long-term  features, 
were  also  found  to  be  very  important  to  the  discrimination 
between  the  classes.  These  parameters  [MFO],  [SDFO],  [V+] 
alone  were  able  to  distinguish  between  the  classes  with 
accuracies  of  37.57%,  23.07%,  and  26.78%  respectively. 
However,  as  can  be  seen  from  the  results,  it  is  the 
combination  of  the  short-term  and  long-term  features  that 
provide  the  most  accurate  classifications  of  the  abnormal 
signals. 
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VI.  Conclusion 

The  results  from  this  work  suggest  that  a  intelligent  voice 
quality  assessment  system  incorporating  an  MLP  neural 
network  can  be  trained  to  provide  objective  classifications 
of  voice  quality  in  line -with  the  7-point  ranking  scheme 
used  by  the  SALT. 

However,  it  should  noted  that  the  MLP  has  been  trained 
on  the  assessments  of  one  SALT,  which  could  lead  to 
subjectively  biased  results.  The  collection  of  patient  speech 
data,  including  voice  quality  rankings  from  several  SALTs 
in  the  region  is  now  taking  place,  and  will  hopefully  provide 
a  larger,  and  less  biased  dataset  for  training  the  system. 

At  the  same  time,  work  is  taking  place  to  identify  and 
evaluate  other  parameters  that  can  be  derived  from  the 
speech  data,  in  particular  the  acoustic  data  which  has  been 
largely  ignored  in  this  study  so  far,  in  order  to  further 
improve  the  accuracy  and  reproducibility  of  the  system. 
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